1 Introduction

This is a general template on how you can analyze your RNA-Seq data using the DESeq2-package and the DESeqtools-package, which provides additional plots and some convenient wrappers.

The aim was to write a script that can be used as a template for standard bulk RNA-seq analysis. For this example analysis, a RNASeq dataset of human isolated monoctyes is used, which were differentiated to monocyte-derived dendritic cells (moDC) in presence or absence of interleukin-10 (IL-10). In addition, cell maturation by application of LPS leads to a total number of 4 conditions with 3 samples each. The raw data of the 12 samples is accessible under https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE92852.

The data has been preprocessed using kallisto.

This example analysis script covers:

  1. Installing and loading the DESeqtools package

  2. Project Information

Here, you are supposed to write down all necessary information about the project you are working on! We think this is very important for collaboration partners or other researches working with the data (after your time) to know the background of the project and the preprocessing steps performed!

  1. Obligatory Data Structure

The data to be analyzed has to be in a certain structure, which is pretty much the structure it has coming out of out the preprocessing pipeline. Additionally, you need to download annotation files from a public sciebo folder.

  1. Data Import
  • Gene and Sample Annotation files
  • Fastq and Kallisto Quality Check
  • Kallisto Import using TXimport
  1. Building the DESeq Dataset
  • Pre-Filtering
  • Statistical Calculations
  • Produce normalized Count table
  • Produce variance-stabilized count table (rlog)
  1. Exploratory Data Analysis
  • Gene Types
  • Highest Expressed Genes
  • Heatmaps of all and variable Genes
  • Correlation and Distance Plots
  • PCA
  • Single Gene Expression Box Plot
  • Heatmap of selected Gene Sets without enrichment
  1. Batch Effect Removal (treat with caution!)

In case you observed an obvious batch effect in the data during the exploratory data analysis, this part of the script allows you to estimate the influence of both known and hidden batch effects in a PCA. Batch variables can later be included in the DESeq2 design.

  1. Differential Expression analysis
  • Automated DE analysis of specified comparisons
  • Produces a big list of all the output of the differential expression analysis!
  • Overall visualization of DE results
  • Heatmap of the union of all DE genes
  • Venn diagrams
  • Ratio-Ratio Plots
  • Ranked FC plots v. GSEA across comparisons (incl. GO & KEGG)
  • Likelihood Ratio Test
  • Analysis of specific comparisons
  • MA plots
  • P value distribution
  • Heatmap of DE genes
  • Volcano Plot v. GSEA (Incl. GO, KEGG, Hallmark, canonicalPathways, Motifs, ImmunoSignatures)
  1. Export of the results

  2. Save image and session Info

Some explanations in this script are borrowed from the great vignette of the DESeq2-package. You can find the full documentatio here: http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html

2 Installing and loading the DESeqtools package

First, you need to make sure that you have the devtools package installed. Then, you can install the DESeqtools package from Github, which will automatically install all package dependencies and annotation files.

#install.packages("devtools")
# library("devtools")
#install_github("deniznazs/DESeqtools", ref = "package")

To load the package, run:

library(devtools)
## Warning: package 'devtools' was built under R version 3.6.1
## Loading required package: usethis
devtools::load_all(path = "E:/Deniz/Git/usr/bin/DESeqtools_Deniz")
## Loading DESeqtools
## 
## Registered S3 method overwritten by 'enrichplot':
##   method               from
##   fortify.enrichResult DOSE
## 
## 
## Warning in (function (dep_name, dep_ver = "*") : Need RColorBrewer >= 2.0.0
## but loaded version is 1.1-2
## Loading required package: pheatmap
## Loading required package: DESeq2
## Loading required package: S4Vectors
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter,
##     Find, get, grep, grepl, intersect, is.unsorted, lapply, Map,
##     mapply, match, mget, order, paste, pmax, pmax.int, pmin,
##     pmin.int, Position, rank, rbind, Reduce, rownames, sapply,
##     setdiff, sort, table, tapply, union, unique, unsplit, which,
##     which.max, which.min
## 
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:base':
## 
##     expand.grid
## Loading required package: IRanges
## 
## Attaching package: 'IRanges'
## The following object is masked from 'package:grDevices':
## 
##     windows
## Loading required package: GenomicRanges
## Loading required package: GenomeInfoDb
## Loading required package: SummarizedExperiment
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: DelayedArray
## Loading required package: matrixStats
## 
## Attaching package: 'matrixStats'
## The following objects are masked from 'package:Biobase':
## 
##     anyMissing, rowMedians
## Loading required package: BiocParallel
## 
## Attaching package: 'DelayedArray'
## The following objects are masked from 'package:matrixStats':
## 
##     colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
## The following objects are masked from 'package:base':
## 
##     aperm, apply, rowsum
## Loading required package: tidyr
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:S4Vectors':
## 
##     expand
## Loading required package: ggplot2
## Loading required package: ggrepel
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:IRanges':
## 
##     space
## The following object is masked from 'package:S4Vectors':
## 
##     space
## The following object is masked from 'package:stats':
## 
##     lowess
## Loading required package: ggbeeswarm
## Loading required package: hexbin
## Loading required package: reshape2
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
## Loading required package: factoextra
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
## Loading required package: Hmisc
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following object is masked from 'package:Biobase':
## 
##     contents
## The following objects are masked from 'package:base':
## 
##     format.pval, units
## Loading required package: VennDiagram
## Loading required package: grid
## Loading required package: futile.logger
## Loading required package: openxlsx
## Loading required package: rhdf5
## Loading required package: clusterProfiler
## clusterProfiler v3.12.0  For help: https://guangchuangyu.github.io/software/clusterProfiler
## 
## If you use clusterProfiler in published research, please cite:
## Guangchuang Yu, Li-Gen Wang, Yanyan Han, Qing-Yu He. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology. 2012, 16(5):284-287.
## 
## Attaching package: 'clusterProfiler'
## The following object is masked from 'package:DelayedArray':
## 
##     simplify
## Loading required package: GSEABase
## Loading required package: annotate
## Loading required package: AnnotationDbi
## Loading required package: XML
## Loading required package: graph
## 
## Attaching package: 'graph'
## The following object is masked from 'package:XML':
## 
##     addNode
## Loading required package: RColorBrewer
## Loading required package: ComplexHeatmap
## ========================================
## ComplexHeatmap version 2.0.0
## Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
## Github page: https://github.com/jokergoo/ComplexHeatmap
## Documentation: http://jokergoo.github.io/ComplexHeatmap-reference
## 
## If you use it in published research, please cite:
## Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional 
##   genomic data. Bioinformatics 2016.
## ========================================
## Loading required package: tximport
## Loading required package: vsn
## Loading required package: genefilter
## 
## Attaching package: 'genefilter'
## The following object is masked from 'package:ComplexHeatmap':
## 
##     dist2
## The following objects are masked from 'package:matrixStats':
## 
##     rowSds, rowVars
## Loading required package: biomaRt
## Loading required package: limma
## 
## Attaching package: 'limma'
## The following object is masked from 'package:DESeq2':
## 
##     plotMA
## The following object is masked from 'package:BiocGenerics':
## 
##     plotMA
## Loading required package: sva
## Loading required package: mgcv
## Loading required package: nlme
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:IRanges':
## 
##     collapse
## This is mgcv 1.8-28. For overview type 'help("mgcv-package")'.
## 
## Attaching package: 'mgcv'
## The following object is masked from 'package:futile.logger':
## 
##     scat
## Loading required package: IHW
## 
## Attaching package: 'IHW'
## The following object is masked from 'package:ggplot2':
## 
##     alpha
## Loading required package: org.Mm.eg.db
## Loading required package: org.Hs.eg.db
## Loading required package: ggpubr
## Warning: package 'ggpubr' was built under R version 3.6.1
## Loading required package: magrittr
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:tidyr':
## 
##     extract
## 
## Attaching package: 'ggpubr'
## The following object is masked from 'package:VennDiagram':
## 
##     rotate
## Loading required package: useful
## Warning: package 'useful' was built under R version 3.6.1
## Loading required package: checkmate
## 
## Attaching package: 'checkmate'
## The following object is masked from 'package:matrixStats':
## 
##     anyMissing
## The following object is masked from 'package:Biobase':
## 
##     anyMissing
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:nlme':
## 
##     collapse
## The following object is masked from 'package:biomaRt':
## 
##     select
## The following objects are masked from 'package:GSEABase':
## 
##     intersect, setdiff, union
## The following object is masked from 'package:graph':
## 
##     union
## The following object is masked from 'package:AnnotationDbi':
## 
##     select
## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize
## The following object is masked from 'package:matrixStats':
## 
##     count
## The following object is masked from 'package:Biobase':
## 
##     combine
## The following objects are masked from 'package:GenomicRanges':
## 
##     intersect, setdiff, union
## The following object is masked from 'package:GenomeInfoDb':
## 
##     intersect
## The following objects are masked from 'package:IRanges':
## 
##     collapse, desc, intersect, setdiff, slice, union
## The following objects are masked from 'package:S4Vectors':
## 
##     first, intersect, rename, setdiff, setequal, union
## The following objects are masked from 'package:BiocGenerics':
## 
##     combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: plotly
## Warning: package 'plotly' was built under R version 3.6.1
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:biomaRt':
## 
##     select
## The following object is masked from 'package:ComplexHeatmap':
## 
##     add_heatmap
## The following object is masked from 'package:AnnotationDbi':
## 
##     select
## The following object is masked from 'package:Hmisc':
## 
##     subplot
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:IRanges':
## 
##     slice
## The following object is masked from 'package:S4Vectors':
## 
##     rename
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
devtools::document()
## Updating DESeqtools documentation
## Warning: @format [E:\Deniz\Git\usr\bin\DESeqtools_Deniz\R\data.R#244]:
## requires a value
## Writing NAMESPACE
## Loading DESeqtools
## Warning in (function (dep_name, dep_ver = "*") : Need RColorBrewer >= 2.0.0
## but loaded version is 1.1-2
## Writing NAMESPACE

Since the gene ontology files are too large to put them on github, you have to load them in two chunks and combine them.

# for human
GO_hs <- rbind(GO_hs_1, GO_hs_2)
# for mouse
GO_mm <- rbind(GO_mm_1, GO_mm_2)

3 Project Information

Please specify important information on the data set.

  • Scienfitic question:

  • IDs: 1553-1560
  • Samples: 12
  • Conditions: Immatur moDCs, LPS-treated moDCs, Immature IL-10-APCs, LPS-treated IL-10-APCs
  • Species: human
  • Cell type(s): moDCs, IL10-APCs
  • RNA-isolation:
  • Library-Production:
  • Sequencing method:
  • Sequencing run(s):
  • Alignment:
  • QC:

  • Project directories on Illumina (fastq files):
  • Project directory on blades (processed files):

Workflow Summary:

Space for notes concerning the global workflow and quality checks.

4 Obligatory Data Structure

Obligatory structure of your directory:

The project directory needs to contain a folder /Data, which contains:

  1. The output folder of the bulk RNAseq Kallisto pipeline including the folders kallisto and qc
  • The kallisto output files should be in: Data/output/kallisto/kallisto/
  • The Quality Control files should be in: Data/output/kallisto/mutiqc/multiqc_data/multiqc_kallisto.txt and Data/output/qc/multiqc/multiqc_data/multiqc_fastqc.txt.
  1. A sample table: sample_table.txt, which should contain at least the following columns:
  • Sample Identifier (e.g. 3609): “ID”
  • Condition (e.g. wt_8mo): “condition” –> merged sample information that is used for DE calling
  • Optional additional information: gender, age, date of experiment, …

The gene annotation files as well as the GMT files for human and mouse are loaded with the package.

Specify the organism of the data set

organism = "human" # choose "mouse" or "human"

Specify the project directory here:

If you want to run the script on your own data, you can specify the directory and save it unrer “dir”. Then, the output directories are automatically created.

dir <- "E:/Deniz/Git/usr/bin/DESeqtools_Deniz/inst/extdata"

# creating output directories (if not already existing):
dir.create(file.path(dir, "Analysis", "Tables"), recursive = T)
dir.create(file.path(dir, "Analysis", "Plots"), recursive = T)

For demonstation purposes, we take data which is installed with the package and therefore use another file directory that points to the preinstalled files.

dir <- system.file("extdata", package = "DESeqtools")

4.1 Colour scheme customization

Define you colour schemes according to your data set and your variables!

For hex colors use: https://www.color-hex.com

# cell_Type
col_cellType <- c("#81c354", "#86BE9E")
names(col_cellType) <- c("moDC", "IL10APC") 
# treatment
col_treat = c("#CCBD80","#FFEDA0", "#FEB24C", "#F03B20")
names(col_treat) <- c("undiff", "LPS","IL10", "LPS_IL10")
# donor
col_donor <- c("#66C3D0", "#008DD2","#0066D2")
names(col_donor) <- c("BC2", "BC3","BC5")
# merged: cell_Type and treatment
col_condition <- c(brewer.pal(3, "Blues")[2:3], brewer.pal(3, "Oranges")[2:3])
names(col_condition) <- c("Immature_moDCs","LPStreated_moDCs","Immature_IL10APCs","LPStreated_IL10APCs")

# combine color code into list
ann_colors_new <- list(cell_Type = col_cellType, 
                   treatment=col_treat,
                  donor = col_donor, 
                  condition=col_condition)

5 Data Import

5.1 Load gene annotation

This gene annotation file will be used 1) to map the Ensembl transcript IDs to Ensembl gene IDs during the tximport function and 2) annotate the Ensembl IDs with additional information such as gene symbol or type.

For consistency, this file should have been produced from the .gtf file used for building the kallisto index. It needs to consist of four columns: Gene ENSEMBL ID, Transcript ID, Gene Symbol, Gene Type. ## Load sample table

Now, we read a table containing all available metainformation for the samples. This table needs to be prepared beforehand from information on the sequencing tracker or provided by the experimental partners. Usually, you would read this from your local computer. Here, we just that the example sampletable, that comes with the DESeqtools-package.

sample_table <- example_sampletable
rownames(sample_table) <- sample_table$ID

5.1.1 Format sample table

For the correct order of the samples in later plots, we define the column of our sample table that contains the information of interest and reorder its factor levels.

## Add columns with factors for comparisons in model
sample_table$condition <- factor(sample_table$condition,
                                 levels = c("Immature_moDCs", "Immature_IL10APCs", "LPStreated_IL10APCs", "LPStreated_moDCs"))

# order according to factor of interest
sample_table <- sample_table[order(sample_table$condition),]

# define factor for order of samples in plotting 
plot_order <- "condition"

5.2 Quality check

The following visualizations of the data quality check results are only an addition to the much more comprehensive quality plots provided in the MultiQC .html files.

Please check the multiQC output thoroughly before you perfrom any analysis.

5.2.1 Fastq QC

Read and visualize fastQC MultiQC output.

For many projects, multiple sequencing runs are necessary to get a satisfactory sequencing depth for all samples. Furthermore, samples can be distributed onto multiple lanes of a flowcell.Since the quality of each single run and lane can differ, we want to look at all fastq files from all runs and lanes seperately.

For more information on fastQC, please check: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

fastq_qc <- read.delim(file.path(dir, "Data", "output", "qc", "multiqc", "multiqc_data",
                                 "multiqc_fastqc.txt"),
                       stringsAsFactors = F)

fastq_qc$ID <- unlist(lapply(strsplit(fastq_qc$Sample, split = "_"), `[[`, 4)) # sample identifiers
fastq_qc$run <- unlist(lapply(strsplit(fastq_qc$Sample, split = "_"), `[[`, 2)) # run identifiers
fastq_qc$lane <- unlist(lapply(strsplit(fastq_qc$Sample, split = "_"), `[[`, 6)) # lane identifiers

rownames(fastq_qc) <- paste(fastq_qc$run, fastq_qc$ID, fastq_qc$lane, sep = "_")

Visualization: Heatmap of fastQC quality checks

fastQC_HC <- t(fastq_qc[,colnames(fastq_qc) %in% c("per_base_sequence_quality",
                                              "per_sequence_quality_scores",
                                              "per_base_sequence_content",
                                              "per_sequence_gc_content",
                                              "per_base_n_content",
                                              "sequence_length_distribution",
                                              "sequence_duplication_levels",
                                              "overrepresented_sequences",
                                              "adapter_content")])

draw(Heatmap(fastQC_HC,
        c("firebrick2", "forestgreen", "goldenrod1"),
        column_title = "FastQC results"),
     heatmap_legend_side = "left")

5.2.2 Kallisto QC

Read and visualize kallisto MultiQC output.

Before alignment, multiple fastq files from multiple lanes or runs for a single sample are merged and aligned as one sample. Thus, we have only one alignment score per sample.

Since we want to later visualize the alignment statistics in analysis plots, we add this information to our sample table.

kallisto_qc <- read.delim(file.path(dir, "Data", "output", "kallisto", "multiqc-kallisto", "multiqc_data", "multiqc_kallisto.txt"), stringsAsFactors = F)

# change sample column 
kallisto_qc$Sample <- do.call(rbind, strsplit(kallisto_qc$Sample, split = "\\_"))[,1]

# Merge kallisto QC to sample table
sample_table <- merge(sample_table, kallisto_qc, by.x = "ID", by.y = "Sample")

Visualization of pseudoaligned and total reads:

alignstat <-sample_table[,c("ID", "pseudoaligned_reads","total_reads")]
alignstat <-gather(alignstat, "total_reads", "pseudoaligned_reads", key = "type", value = "reads")

ggplot(alignstat, aes(x = reorder(ID,-reads), y = reads, fill = type)) + 
  geom_bar(stat = "identity", position = "dodge", colour = "black") + 
  scale_y_continuous(breaks = c(c(5 , 10, 25, 50, 100) * 10^6))+
  scale_fill_manual(name = "", values = c("white", "grey")) + 
  ylab("Reads") + 
  xlab("Sample IDs") + 
  geom_hline(yintercept=5 * 10^6, linetype="dashed", color = "red") +
  ggtitle("Total reads and pseudoaligned reads") +
  theme(axis.text.x = element_text(angle = 70, size = 9, hjust = 1, vjust = 1), 
        strip.background = element_rect(fill = "white"), 
        panel.background = element_rect(fill = "white", colour = "black"), 
        legend.position = "bottom"
  )

5.2.3 Exclude samples after QC (based e.g. on number of unique alignments, percent of aligned reads?)

In case, the data set contains samples of poor quality or simply to few reads, we might want to exclude these samples. In the following chunks, all samples with less than 5.000.000 reads are excluded from the subsequent analysis.

# define read cutoff
samples_to_keep <- sample_table[sample_table$pseudoaligned_reads > 5000000,]$ID

# make reduced sampleTable file 
sample_table <- sample_table[which(sample_table$ID %in% samples_to_keep),]
rownames(sample_table) <- sample_table$ID

5.3 TXimport

A new and recommended pipeline for RNA-seq analysis is to use fast transcript abundance quantifiers, such as kallisto or Salmon, upstream of DESeq2, and then to create gene-level count matrices for use with DESeq2 by importing the quantification data using the tximport package.

We use tximport and DESeq2 based on the gene-level estimated counts from Kallisto (Bray, Pimentel, Melsted, Pachter 2016).

Some advantages of using this methods for transcript abundance estimation are:

  • this approach corrects for potential changes in gene length across samples (e.g. from differential isoform usage) (Trapnell et al. 2013),
  • some of these methods (Salmon, Sailfish, kallisto) are substantially faster and require less memory and disk usage compared to alignment-based methods that require creation and storage of BAM files, and,
  • it is possible to avoid discarding those fragments that can align to multiple genes with homologous sequence, thus increasing sensitivity (Robert and Watson 2015).

Full details on the motivation and methods for importing transcript level abundance and count estimates, summarizing to gene-level count matrices and producing an offset which corrects for potential changes in average transcript length across samples are described in (Soneson, Love, and Robinson 2015). Note that the tximport-to-DESeq2 approach uses estimated gene counts from the transcript abundance quantifiers, but not normalized counts.

TXimport imports transcript-level abundance, estimated counts and transcript lengths, and summarizes these into matrices for use with downstream statistical analysis packages such as edgeR, DESeq2, limma-voom. Average transcript length, weighted by sample-specific transcript abundance estimates, is provided as a matrix, which is then used as an offset for different expression of gene-level counts.

# Define path where the Kallisto files are stored
files <- paste(dir, "/Data/output/kallisto/kallisto/", sample_table$ID, "/abundance.h5", sep = "")
# Naming the entries in the vector assures correct column names in the expression tables
names(files) <- sample_table$ID
# choose your tx2gene-annotation based on the annotation you used for kallisto
tx_annotation <- tx_annotation_v27_human

# make sure sample annotation and counts are in the same order
identical(rownames(sample_table), names(files))
## [1] TRUE
# reorder the files to import
files <- files[as.character(sample_table$ID)]
identical(rownames(sample_table), names(files)) # should be true now
## [1] TRUE
# Import samples and perform the distribution of transcripts to genes
txi_kallisto <- tximport(files, 
                         type="kallisto", 
                         tx2gene=tx_annotation[,2:1])
## 1 2 3 4 5 6 7 8 9 10 11 12 
## summarizing abundance
## summarizing counts
## summarizing length

6 Building the DESeq Dataset

The object class used by the DESeq2 package to store the read counts and the intermediate estimated quantities during statistical analysis is the DESeqDataSet, which will usually be represented in the code here as an object called “dds”.

A DESeqDataSet object must have an associated design formula. The design formula expresses the variables which will be used in modeling. The formula should be a tilde (~) followed by the variables with plus signs between them (it will be coerced into a formula if it is not already). The design can be changed later, however then all differential analysis steps should be repeated, as the design formula is used to estimate the dispersions and to estimate the log2 fold changes of the model.

dds_txi <- DESeqDataSetFromTximport(txi = txi_kallisto, 
                                    colData = sample_table,
                                    design = ~ condition)
## using counts and average transcript lengths from tximport

6.1 Pre-filtering

While it is not necessary to pre-filter low count genes before running the DESeq2 functions, there are two reasons which make pre-filtering useful: * by removing rows in which there are very few reads, we reduce the memory size of the dds data object, and * we increase the speed of the transformation and testing functions within DESeq2.

Here, we perform a minimal pre-filtering to keep only rows that have at least 10 reads total.

Note that more strict filtering to increase power is automatically applied via independent filtering or independent hypothesis weighting on the mean of normalized counts within the results function.

genes_to_keep <- rowSums(counts(dds_txi)) >= 10
dds <- dds_txi[genes_to_keep,]

Number of genes after filtering is: 24016

6.2 DESeq calculations

DESeq2: Estimate variance-mean dependence (2) in count data from high-throughput sequencing assays and test for differential expression based on a model using the negative binomial distribution (1).

  1. In inferential testing a distributional assumption is needed because we want to estimate the probability of extreme events just appearing by chance (e.g. large fold change) from limited replicates. The test statistic of ANOVA (or t-test) follows a Student’s t distribution, a specific case of the normal distribution. However, counts, as produced in RNA-seq experiments, cannot be normally distributed by definition(you can’t have -3 counts, or 12.2 counts). Two distributions for count based data are Poisson, which presumes that the variance and mean are equal, or negative binomial, a.k.a. Gamma-Poisson, which does not. The spread of values among biological replicates is more than given by the one parameter Poisson distribution and it seems to be captured by the two parameter (mean & variance) NB sufficiently well.

  2. Information sharing across genes for variance estimation:
    In order to test the differential expression of a gene, we need to estimate its mean and variance for the underlying negative binomial distribution. Inferential methods that treat each gene separately suffer from the high uncertainty of within-group variance estimates. However, this limitation can be overcome by pooling information across genes, specifically, by exploiting assumptions about the similarity of the variances of different genes measured in the same experiment . DESeq2 detects and corrects dispersion estimates through modeling of the dependence of the dispersion on the average mean over all samples.

The standard differential expression analysis steps are wrapped into a single function, DESeq(). The estimation steps performed by this function are described in the manual page for ?DESeq and in the Methods section of the DESeq2 publication (Love, Huber, and Anders 2014).

This function performs a default analysis through the steps:

  1. Estimation of size factors: estimateSizeFactors

  2. Estimation of dispersion: estimateDispersions

  3. Negative Binomial GLM fitting and Wald statistics: nbinomWaldTest

For complete details on each step, see the manual pages of the respective functions.

dds <- DESeq(dds)
## estimating size factors
## using 'avgTxLength' from assays(dds), correcting for library size
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing

6.3 Normalized count table

For inspection of the normalized data, we write the normalized counts into a data.frame called “norm_anno”.

norm_anno <- as.data.frame(counts(dds, normalized=T))
unnorm_anno<-as.data.frame(counts(dds, normalized=F))#Deniz

norm_anno$GENEID <- row.names(norm_anno)
unnorm_anno$GENEID <- row.names(unnorm_anno)#Deniz

# add gene annotation extracted from the gtf file
gene_annotation <- tx_annotation[!duplicated(tx_annotation$GENEID),c("GENEID", "SYMBOL", "GENETYPE")]
gene_annotation <- gene_annotation[match(rownames(norm_anno), gene_annotation$GENEID),]
gene_annotation <- gene_annotation[match(rownames(unnorm_anno), gene_annotation$GENEID),]#Deniz

# check if row names of the normalized table and the gene annotation match perfectly
all(rownames(norm_anno) == gene_annotation$GENEID)
## [1] TRUE
all(rownames(unnorm_anno) == gene_annotation$GENEID)#Deniz
## [1] TRUE
# add additional gene annotation downloaded from biomart
if(organism == "human"){
  biomart <- biomart_human
} else if( organism == "mouse"){
 biomart <- biomart_mouse
  }

idx <- match(unlist(lapply(strsplit(gene_annotation$GENEID, split = "[.]"), `[[`, 1)), biomart$Gene.stable.ID)
gene_annotation$DESCRIPTION <- biomart$Gene.description[idx]
gene_annotation$CHR <- biomart$Chromosome.scaffold.name[idx]

# merge expression table and annotation
norm_anno <- merge(norm_anno,
                   gene_annotation,
                   by = "GENEID")
rownames(norm_anno) <- norm_anno$GENEID

norm_anno[1:3,c(1:2, (ncol(norm_anno)-5):ncol(norm_anno))]
##                                GENEID      1553       1567       1568
## ENSG00000000003.14 ENSG00000000003.14   0.00000   1.236363   0.803422
## ENSG00000000419.12 ENSG00000000419.12 312.45092 275.259969 245.334723
## ENSG00000000457.13 ENSG00000000457.13  76.17025  83.453860  76.727878
##                    SYMBOL       GENETYPE
## ENSG00000000003.14 TSPAN6 protein_coding
## ENSG00000000419.12   DPM1 protein_coding
## ENSG00000000457.13  SCYL3 protein_coding
##                                                                                                       DESCRIPTION
## ENSG00000000003.14                                              tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858]
## ENSG00000000419.12 dolichyl-phosphate mannosyltransferase subunit 1, catalytic [Source:HGNC Symbol;Acc:HGNC:3005]
## ENSG00000000457.13                                   SCY1 like pseudokinase 3 [Source:HGNC Symbol;Acc:HGNC:19285]
##                    CHR
## ENSG00000000003.14   X
## ENSG00000000419.12  20
## ENSG00000000457.13   1
#Deniz
unnorm_anno <- merge(unnorm_anno,
                   gene_annotation,
                   by = "GENEID")
rownames(unnorm_anno) <- unnorm_anno$GENEID

unnorm_anno[1:3,c(1:2, (ncol(unnorm_anno)-5):ncol(unnorm_anno))]
##                                GENEID 1553 1567 1568 SYMBOL       GENETYPE
## ENSG00000000003.14 ENSG00000000003.14    0    1    1 TSPAN6 protein_coding
## ENSG00000000419.12 ENSG00000000419.12  182  282  381   DPM1 protein_coding
## ENSG00000000457.13 ENSG00000000457.13   48   88  131  SCYL3 protein_coding
##                                                                                                       DESCRIPTION
## ENSG00000000003.14                                              tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858]
## ENSG00000000419.12 dolichyl-phosphate mannosyltransferase subunit 1, catalytic [Source:HGNC Symbol;Acc:HGNC:3005]
## ENSG00000000457.13                                   SCY1 like pseudokinase 3 [Source:HGNC Symbol;Acc:HGNC:19285]
##                    CHR
## ENSG00000000003.14   X
## ENSG00000000419.12  20
## ENSG00000000457.13   1

6.4 Boxplot of normalized and unnorm gene expression

#Boxplot and comparison btw unnorm_anno vs norm_anno

######first, create a sample table just taking the sample ID and condition for boxplot
box_sample_table<-sample_table[,c("ID","condition")]#depending on choice, change the condition to smt else

#ordering for the legend labels
legend_order <- factor(unique(sample_table$condition),
                 levels = unique(sample_table$condition))
legend_order<-levels(legend_order)

######norm_anno boxplot
box_norm_table<-norm_anno[,colnames(norm_anno)%in%box_sample_table$ID]
box_norm_table$GENEID<-rownames(box_norm_table)

###restructuring the table for ggplot2 analysis w/ melt function 

box_norm_table<-melt(box_norm_table, id.vars=c("GENEID"))
colnames(box_norm_table)<-c("GENEID","sample","expression")
box_norm_table<-merge(box_norm_table,box_sample_table,by.x="sample",by.y="ID")

box_norm_table$condition<-factor(box_norm_table$condition, levels=legend_order)

p1<-ggplot(box_norm_table, mapping = aes(x=sample , y= expression+1,fill=condition))+
  geom_boxplot()+
  #geom_boxplot(outlier.shape = NA)+ #to make the outlier points transparent
  scale_y_log10()+
  scale_fill_manual(values = col_condition)+
  #scale_fill_discrete(breaks=fill)+
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 90))+
  xlab("Samples") + ylab("normalized expression") + 
  ggtitle("Normalized gene expression analysis") 



######unnorm_anno boxplot
box_unnorm_table<-unnorm_anno[,colnames(unnorm_anno)%in%box_sample_table$ID]
box_unnorm_table$GENEID<-rownames(box_unnorm_table)

###restructuring the table for ggplot2 analysis w/ melt function 

box_unnorm_table<-melt(box_unnorm_table, id.vars=c("GENEID"))
colnames(box_unnorm_table)<-c("GENEID","sample","expression")
box_unnorm_table<-merge(box_unnorm_table,box_sample_table,by.x="sample",by.y="ID")

box_unnorm_table$condition<-factor(box_unnorm_table$condition, levels=legend_order)

p2<-ggplot(box_unnorm_table, mapping = aes(x=sample , y= expression+1,fill=condition))+
  geom_boxplot()+
  #geom_boxplot(outlier.shape = NA)+ #to make the outlier points transparent
  scale_y_log10()+ #to change the y-axis limit
  scale_fill_manual(values = col_condition)+
  #scale_fill_discrete(breaks=fill)+
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 90),legend.position="none")+
  xlab("Samples") + ylab("unnormalized expression") + 
  ggtitle("Unnormalized gene expression analysis") 

#combining two boxplots into one plot with common legend
library(ggpubr)
ggarrange(p2, p1, ncol=2, common.legend = TRUE, legend="bottom")

6.5 Variance stabilizing transformation

In order to test for differential expression, we operate on raw counts and use discrete distributions. However for other downstream analyses - e.g. for visualization or clustering - it might be useful to work with transformed versions of the count data.

Maybe the most obvious choice of transformation is the logarithm. Since count values for a gene can be zero in some conditions (and non-zero in others), some advocate the use of pseudocounts, i.e. transformations of the form: y=log2(n+n0) where n represents the count values and n0 is a positive constant.

DESeq2 has two alternative approaches that offer more theoretical justification and a rational way of choosing parameters equivalent to n0 above. One makes use of the concept of variance stabilizing transformations (VST) (Tibshirani 1988; Huber et al. 2003; Anders and Huber 2010), and the other is the regularized logarithm or rlog, which incorporates a prior on the sample differences (Love, Huber, and Anders 2014). Both transformations produce transformed data on the log2 scale which has been normalized with respect to library size or other normalization factors.

The point of these two transformations, the VST and the rlog, is to remove the dependence of the variance on the mean, particularly the high variance of the logarithm of count data when the mean is low. Both VST and rlog use the experiment-wide trend of variance over mean, in order to transform the data to remove the experiment-wide trend. Note that we do not require or desire that all the genes have exactly the same variance after transformation. Indeed, in a figure below, you will see that after the transformations the genes with the same mean do not have exactly the same standard deviations, but that the experiment-wide trend has flattened. It is those genes with row variance above the trend which will allow us to cluster samples into interesting groups.

The two functions, vst and rlog have an argument blind, for whether the transformation should be blind to the sample information specified by the design formula. When blind equals TRUE (the default), the functions will re-estimate the dispersions using only an intercept. This setting should be used in order to compare samples in a manner wholly unbiased by the information about experimental groups, for example to perform sample QA (quality assurance) as demonstrated below.

However, blind dispersion estimation is not the appropriate choice if one expects that many or the majority of genes (rows) will have large differences in counts which are explainable by the experimental design, and one wishes to transform the data for downstream analysis. In this case, using blind dispersion estimation will lead to large estimates of dispersion, as it attributes differences due to experimental design as unwanted noise, and will result in overly shrinking the transformed values towards each other. By setting blind to FALSE, the dispersions already estimated will be used to perform transformations, or if not present, they will be estimated using the current design formula. Note that only the fitted dispersion estimates from mean-dispersion trend line are used in the transformation (the global dependence of dispersion on mean for the entire experiment). So setting blind to FALSE is still for the most part not using the information about which samples were in which experimental group in applying the transformation.

# Please choose to use rlog or VST for the transformation: rlog is recommended for less than 30 samples, vst for more than 30 samples for the sake of computing time

dds_vst <- rlog(dds, blind = TRUE)

# dds_vst <- vst(dds, blind = TRUE)

For later plotting of Heatmaps, it may be useful to have a dataframe of the variance stabilized values as log2-transformed (vst_anno_log2) as well as as normal counts (vst_anno):

vst_anno_log2<-as.data.frame(assay(dds_vst))
vst_anno_log2<-merge(vst_anno_log2,norm_anno[,c("GENEID","SYMBOL","GENETYPE","DESCRIPTION","CHR")],by="row.names")
rownames(vst_anno_log2)<-vst_anno_log2$Row.names
vst_anno_log2$Row.names<-NULL

vst_anno<-as.data.frame(assay(dds_vst))
vst_anno<-2^vst_anno
vst_anno<-merge(vst_anno,norm_anno[,c("GENEID","SYMBOL","GENETYPE","DESCRIPTION","CHR")],by="row.names")
rownames(vst_anno)<-vst_anno$Row.names
vst_anno$Row.names<-NULL

6.5.1 Plot row standard deviations versus row means

meanSdPlot(as.matrix(assay(dds_vst)), ranks = FALSE)


7 Exploratory Data Analysis

7.0.0.1 Sample annotation and colour for plotting

# choose columns from sampletable for heatmap annotation
plot_annotation <- sample_table[,c("condition","donor","pseudoaligned_reads"), drop = F]

rownames(plot_annotation) <- sample_table$ID
# choose colours for your conditions
ann_colors <- setColours(factorcolours = c("Set1","Set2"))

7.1 Means for plotting

norm_mean<-mean_function(input=norm_anno, 
                         anno=sample_table,
                         condition="condition")#Deniz: depending on your desired condition, can change it and put any other column name from sample_table

vst_mean<-mean_function(input=vst_anno,
                        anno = sample_table,
                        condition = "condition")#Deniz: depending on your desired condition, can change it and put any other column name from sample_table

library(useful)
corner(vst_mean)
##                                GENEID Immature_moDCs LPStreated_moDCs
## ENSG00000000003.14 ENSG00000000003.14        2.01630         1.600599
## ENSG00000000419.12 ENSG00000000419.12      283.11881       298.194813
## ENSG00000000457.13 ENSG00000000457.13       91.28087       138.085578
## ENSG00000000460.16 ENSG00000000460.16       37.03521        26.622899
## ENSG00000000938.12 ENSG00000000938.12     2724.02226       922.685747
##                    Immature_IL10APCs LPStreated_IL10APCs
## ENSG00000000003.14          1.650472            1.630697
## ENSG00000000419.12        307.124451          256.273981
## ENSG00000000457.13        102.633810           97.864680
## ENSG00000000460.16         37.612088           36.503026
## ENSG00000000938.12       2727.889866         2113.536536

Mean sample table

# function to create mean_sample_table depending on your condition called mean_sample_definition

mean_sample_table<- sample_table[!duplicated(sample_table$condition),]#mean sample table is generated firsly from sample_table
rownames(mean_sample_table)<-mean_sample_table$condition
mean_sample_table<-subset(mean_sample_table,select=-c(ID,donor))#ID and donor columns are excluded

mean_sample_table<-mean_sample_definition(input=mean_sample_table,
                                          anno=sample_table,
                                          condition="condition")
## [1] "Immature_moDCs"
## [1] "LPStreated_moDCs"
## [1] "Immature_IL10APCs"
## [1] "LPStreated_IL10APCs"
mean_sample_table
##                               condition cell_Type treatment total_reads
## Immature_moDCs           Immature_moDCs      moDC    undiff     9786258
## LPStreated_moDCs       LPStreated_moDCs      moDC       LPS    12241293
## Immature_IL10APCs     Immature_IL10APCs   IL10APC      IL10     9674108
## LPStreated_IL10APCs LPStreated_IL10APCs   IL10APC  LPS_IL10    18533050
##                     pseudoaligned_reads not_pseudoaligned_reads
## Immature_moDCs                  7957867                 1828391
## LPStreated_moDCs               10284543                 1956751
## Immature_IL10APCs               8654087                 1020021
## LPStreated_IL10APCs            16387226                 2145823
##                     percent_aligned                  ID
## Immature_moDCs             80.92075      Immature_moDCs
## LPStreated_moDCs           83.96828    LPStreated_moDCs
## Immature_IL10APCs          89.24532   Immature_IL10APCs
## LPStreated_IL10APCs        88.37550 LPStreated_IL10APCs
corner(mean_sample_table)
##                               condition cell_Type treatment total_reads
## Immature_moDCs           Immature_moDCs      moDC    undiff     9786258
## LPStreated_moDCs       LPStreated_moDCs      moDC       LPS    12241293
## Immature_IL10APCs     Immature_IL10APCs   IL10APC      IL10     9674108
## LPStreated_IL10APCs LPStreated_IL10APCs   IL10APC  LPS_IL10    18533050
##                     pseudoaligned_reads
## Immature_moDCs                  7957867
## LPStreated_moDCs               10284543
## Immature_IL10APCs               8654087
## LPStreated_IL10APCs            16387226

Specify mean sample annotation for plotting

# choose columns from sampletable for heatmap annotation
#consistency for writing in capital or small letters!
mean_plot_annotation <- mean_sample_table[,c("condition","pseudoaligned_reads"), drop = F]

rownames(mean_plot_annotation) <- mean_sample_table$condition

# plot_order

mean_sample_table$condition<- factor(mean_sample_table$condition,
                                    levels=c("Immature_moDCs","LPStreated_moDCs","Immature_IL10APCs","LPStreated_IL10APCs"))

plot_order<-"condition"

7.2 Frequencies of gene types

TypeCounts <- as.data.frame(table(norm_anno$GENETYPE))
colnames(TypeCounts) <- c("Type", "Frequency")
TypeCounts <- subset(TypeCounts, Frequency>0)

ggplot(TypeCounts, aes(x=Type, y= Frequency,  label=Frequency))+
  geom_bar(stat="identity",fill="grey",colour="grey") +
  theme_bw()+
  geom_text(size = 3, position = position_stack(vjust = 1))+
  guides(fill=FALSE)+
  theme(text = element_text(size=10),axis.text.x = element_text(angle =90, hjust = 1))+
  xlab("")

7.3 Histogram of of means per gene over all samples

rMeans <- as.data.frame(log(rowMeans(counts(dds, normalized=TRUE)),10))
colnames(rMeans) <- "rowMeans"
ggplot(rMeans, aes(x = rowMeans)) + 
  geom_histogram(bins=100) +
  xlab("log10(rowMeans)")+
  scale_x_continuous(breaks=c(0,1,2,3,4,5,6))+
  theme_bw()

7.4 Boxplots of highest expressed genes

highestGenes(numGenes = 50)

7.5 Hierarchical Clustering of all present genes

Plot a heatmap of all hierarchically clustered present genes.

I case your machine has only 8 gb of ram, plotting a heatmap of all present genes might exceed your memory capacities and give the error: “Error: cannot allocate vector of size XX Gb”. In that case, you can either reduce the genes to present based on a higher expression cutoff or continue with the heatmap of variable genes.

7.5.1 clustered Columns & Rows-all samples

plotHeatmap(input=norm_anno,
            smp_table = sample_table,
            plot_anno = plot_annotation,
           
            geneset = "all",
            title = "Heatmap of all present genes (normalized counts)",
            show_rownames = FALSE,
            cluster_cols = TRUE,
            gene_type = "all")

plotHeatmap(input=vst_anno,
            smp_table = sample_table,
            plot_anno = plot_annotation,
           
            geneset = "all",
            title = "Heatmap of all present genes (log2 normalized counts)",
            show_rownames = FALSE,
            cluster_cols = TRUE,
            gene_type = "all")

7.5.2 clustered Columns & Rows-means of conditions

plotHeatmap(input=norm_mean,
            smp_table = mean_sample_table,
          
            plot_anno = mean_plot_annotation,
            geneset = "all",
            title = "Heatmap of all present genes (normalized counts)",
            show_rownames = FALSE,
            cluster_cols = TRUE,
            gene_type = "all")

plotHeatmap(input=vst_mean,
            smp_table = mean_sample_table,
    
            plot_anno = mean_plot_annotation,
            geneset = "all",
            title = "Heatmap of all present genes (log2 normalized counts)",
            show_rownames = FALSE,
            cluster_cols = TRUE,
            gene_type = "all")

7.5.3 clustered Rows-all samples

plotHeatmap(input=norm_anno ,
            smp_table = sample_table,
            plot_anno = plot_annotation,
           
            geneset = "all",
            title = "Heatmap of all present genes (normalized counts)",
            show_rownames = FALSE,
            cluster_cols = FALSE,
            gene_type = "all")

plotHeatmap(input=vst_anno ,
            smp_table = sample_table,
            plot_anno = plot_annotation,
           
            geneset = "all",
            title = "Heatmap of all present genes (log2 normalized counts)",
            show_rownames = FALSE,
            cluster_cols = FALSE,
            gene_type = "all")

7.5.4 clustered Rows- means of conditions

plotHeatmap(input=norm_mean,
            smp_table = mean_sample_table,
            plot_anno = mean_plot_annotation,
          
            geneset = "all",
            title = "Heatmap of all present genes (normalized counts)",
            show_rownames = FALSE,
            cluster_cols = FALSE,
            gene_type = "all")

plotHeatmap(input=vst_mean,
            smp_table = mean_sample_table,
            plot_anno = mean_plot_annotation,
       
            geneset = "all",
            title = "Heatmap of all present genes (log2 normalized counts)",
            show_rownames = FALSE,
            cluster_cols = FALSE,
            gene_type = "all")

##            used  (Mb) gc trigger   (Mb)   max used    (Mb)
## Ncells  8155733 435.6   15768767  842.2   15768767   842.2
## Vcells 55058708 420.1 1262836672 9634.7 1578545839 12043.4

7.6 Hierarchical Clustering of most variable genes

# define variable genes
rv = genefilter::rowVars(assay(dds_vst))
q75 = quantile(rowVars(assay(dds_vst)), .75)
q75_names = names(which(rv > q75))

7.6.1 clustered Columns & Rows- all samples

plotHeatmap(input=norm_anno,
            smp_table=sample_table,
           
            geneset =q75_names,
            title = "Heatmap of Top 75% most variable genes (log2 normalized counts)",
            show_rownames = FALSE,
            cluster_cols = TRUE,
            plot_anno=plot_annotation,
            gene_type = "all")

plotHeatmap(input=vst_anno,
            smp_table=sample_table,
           
            geneset =q75_names,
            title = "Heatmap of Top 75% most variable genes (log2 normalized counts)",
            show_rownames = FALSE,
            cluster_cols = TRUE,
            plot_anno=plot_annotation,
            gene_type = "all")

7.6.2 clustered Columns & Rows-means of conditions

plotHeatmap(input=norm_mean,
            smp_table=mean_sample_table,
         
            geneset =q75_names,
            title = "Heatmap of Top 75% most variable genes (log2 normalized counts)",
            show_rownames = FALSE,
            cluster_cols = TRUE,
            plot_anno=mean_plot_annotation,
            gene_type = "all")

7.6.3 clustered Rows-all samples

plotHeatmap(input=norm_anno, 
            smp_table=sample_table,
           
            geneset = q75_names,
            title = "Heatmap of all most variable genes",
            show_rownames = FALSE,
            cluster_cols = FALSE,
            plot_anno=plot_annotation,
            gene_type = "all")

7.6.4 clustered Rows-conditions

plotHeatmap(input=vst_mean, 
            smp_table=mean_sample_table,
           
            geneset = q75_names,
            title = "Heatmap of all most variable genes",
            show_rownames = FALSE,
            cluster_cols = FALSE,
            plot_anno=mean_plot_annotation,
            gene_type="all")

plotHeatmap(input=vst_mean, 
            smp_table=mean_sample_table,
            
            geneset = q75_names,
            title = "Heatmap of all most variable genes",
            show_rownames = FALSE,
            cluster_cols = FALSE,
            plot_anno=mean_plot_annotation,
            gene_type="antisense_RNA")

7.7 Sample-to-Sample correlation & distance

cd_input<-as.data.frame(assay(dds_vst))

7.7.1 Correlation

# Correlation based on variance-stabilized counts
# sampleCor <- as.matrix(cor(assay(dds_vst), use="all.obs", method="pearson"))
# rownames(sampleCor)<- sample_table$ID
# colnames(sampleCor)<- sample_table$ID
# 
# pheatmap(sampleCor,
#          main="Sample Correlation based on variance-stabilized counts",
#          annotation_row = plot_annotation,
#          annotation_col = plot_annotation,
#          annotation_colors = ann_colors,
#          cluster_rows = F,
#          cluster_cols = F,
#          fontsize = 8)

corr_function(sampleCor = cd_input,
               title="Sample Correlation based on variance-stabilized counts",
               gene_anno=gene_annotation,
               plot_anno=plot_annotation,
               gene_type="all",
               cluster_rows=F,
               cluster_cols=F,
               mean=F)

 corr_function(sampleCor = cd_input,
               title="Sample Correlation based on variance-stabilized counts",
               gene_anno=gene_annotation,
               plot_anno=mean_plot_annotation,
               gene_type="all",
               cluster_rows=F,
               cluster_cols=F,
               mean=T)

 corr_function(sampleCor =  cd_input,
               title="Sample Correlation based on variance-stabilized counts",
               gene_anno=gene_annotation,
               plot_anno=plot_annotation,
               gene_type="ribozyme",
               cluster_rows=F,
               cluster_cols=F,
               mean=F)

7.7.2 Distance

This function computes and returns the euclidean distance matrix between the rows of a data matrix, the samples in our case.

# Sample Distances based on variance-stabilized counts
# sampleDist <- as.matrix(dist(t(assay(dds_vst))))
# rownames(sampleDist)<- sample_table$ID
# colnames(sampleDist)<- sample_table$ID
# 
# pheatmap(sampleDist,
#          clustering_distance_rows = dist(t(assay(dds_vst))),
#          clustering_distance_cols = dist(t(assay(dds_vst))), 
#          main="Sample distances based on variance-stabilized counts per sample",
#          annotation_row = plot_annotation, 
#          annotation_col = plot_annotation,
#          annotation_colors = ann_colors,
#          fontsize = 8)

dist_function(sampleDist = cd_input,
              gene_anno=gene_annotation,
              plot_anno=plot_annotation,
              title="Sample distances based on variance-stabilized counts per sample",
              gene_type="all",
              mean=F)

dist_function(sampleDist = cd_input,
              gene_anno=gene_annotation,
              plot_anno=mean_plot_annotation,
              title="Sample distances based on variance-stabilized counts per sample",
              gene_type="all",
              mean=T)

7.8 Principle Component Analysis

Related to the distance matrix is the PCA plot, which shows the samples in the 2D plane spanned by two principal components. This type of plot is useful for visualizing the overall effect of experimental covariates and batch effects.

Principal component analysis (PCA) simplifies the complexity in high-dimensional data while retaining trends and patterns. It does this by transforming the data into fewer dimensions, which act as summaries of features. High-dimensional data are very common in biology and arise when multiple features, such as expression of many genes, are measured for each sample. This type of data presents several challenges that PCA mitigates: computational expense and an increased error rate due to multiple test correction when testing each feature for association with an outcome. PCA is an unsupervised learning method and is similar to clustering1—it finds patterns without reference to prior knowledge about whether the samples come from different treatment groups or have phenotypic differences.

To understand the basics of PCA, please watch: https://www.youtube.com/watch?v=_UVHneBUBW0

7.8.1 Percentage of explained variance of PCs

# Extract the eigenvalues/variances of the principal dimensions
eigenvalue <- get_eig(prcomp(t(assay(dds_vst))))
eigenvalue$dim <- as.numeric(c(1:nrow(eigenvalue)))

ggplot(eigenvalue, aes(dim, variance.percent))+ 
  geom_bar(stat = "identity")+
  geom_line(aes(dim, variance.percent)) +
  geom_point(aes(dim, variance.percent)) +
  geom_line(aes(dim, cumulative.variance.percent), colour= "grey") + 
  geom_point(aes(dim, cumulative.variance.percent), colour= "grey") + 
  scale_x_continuous(breaks=c(1:nrow(eigenvalue)))+
  xlab("Dimensions") +
  ylab("Percentage of explained variances") +
  theme_bw()

7.8.2 Plot PCA

7.8.2.1 Condition

plotPCA(ntop="all", 
        xPC=1, 
        yPC=2,
        color= "condition",
        anno_colour = ann_colors$condition,
        shape="NULL",
        point_size=3,
        gene_type = "all",
        label= "ID",
        title="Principle Component Analysis based on variance-stabilized counts",
        gene_anno = gene_annotation)

plotPCA(ntop="all", 
        xPC=1, 
        yPC=2,
        color= "condition",
        anno_colour = ann_colors$condition,
        shape="NULL",
        point_size=3,
        gene_type = "protein_coding",
        label= "ID",
        title="Principle Component Analysis based on variance-stabilized counts",
        gene_anno = gene_annotation)

7.8.2.2 Donor

plotPCA(ntop="all", 
        xPC=1, 
        yPC=2,
        color="donor",
        anno_colour=ann_colors$donor,
        shape="NULL",
        point_size=3,
        label= "ID",
        label_subset = "1553",
        title="Principle Component Analysis based on variance-stabilized counts")

7.8.2.3 Pseudoaligned Reads

plotPCA(ntop="all", 
         xPC=1, 
         yPC=2,
         color="pseudoaligned_reads",
         anno_colour= ann_colors$pseudoaligned_reads,
         shape="NULL",
         point_size=3,
         title="Principle Component Analysis based on variance-stabilized counts")

7.8.3 3D PCA

Please comment this chunk out when knitting to an HTML file because it will lead to a corrupt HTML file. #@Deniz: you can test this out as well :D

plot3D_pca(pca3d_input=dds_vst,
           sample_table=sample_table,
           gene_anno=gene_annotation,
           gene_type="all",
           xPC=1,
           yPC=2,
           zPC=3,
           ntop="all",
           anno_colour = col_condition,
           point_size =10)
## No trace type specified:
##   Based on info supplied, a 'scatter3d' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter3d
## No scatter3d mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode

7.8.4 PCA loadings

plotLoadings(PC="PC1", ntop="all")

plotLoadings(PC="PC2", ntop="all")

plotLoadings(PC="PC3", ntop="all")

7.9 Gene-to-Gene correlation analysis over all samples

7.9.1 Positive Pearson Correlation

In the following chunk, we define the upper quartile of the variable genes and identify those genes with a high positive correlation (r>0.95).

q75_vst = assay(dds_vst)[rowVars(assay(dds_vst)) > quantile(rowVars(assay(dds_vst)), .75),]

rcor <- rcorr(t(q75_vst),type="pearson")
rcor$sig <- rcor$P<0.05 & rcor$r>0 # define significant positive correlations
rcor_filt <- rcor$r*rcor$si
rcor_filt <- rcor_filt*upper.tri(rcor_filt)
rcor_filt<- replace(rcor_filt, which( rcor_filt==0), NA)
rcor_filt_melt <- melt(rcor_filt,na.rm = TRUE)

rcor_filt_melt_cutoff <- rcor_filt_melt[rcor_filt_melt$value>0.95,]
varR <- unique(rcor_filt_melt_cutoff$Var1)

plotHeatmap(input=vst_anno, 
            smp_table=sample_table,
           
            geneset = varR,
            title = "Heatmap of all variable and co-expressed genes",
            show_rownames = FALSE,
            cluster_cols = FALSE,
            plot_anno=plot_annotation,
            gene_type = "all")

7.10 Expression of selected genes across conditions

Plot the expression of selected genes across conditions.

plotSingleGene(data=norm_anno, 
                 symbol="ITGAM", 
                 condition="condition",
                 anno_colour=ann_colors$condition,
                 shape= NULL)+theme(axis.text.x = element_text(angle = 90, hjust = 1))
## [1] "Selected gene symbol assigned to one gene (Ensembl ID)"

plotSingleGene(data=norm_anno, 
                 symbol="WDR18", 
                 condition="condition",
                 anno_colour=ann_colors$condition,
                 shape= "donor")+theme(axis.text.x = element_text(angle = 90, hjust = 1))
## [1] "Selected gene symbol assigned to one gene (Ensembl ID)"

7.11 Heatmap of selected genes sets

Plot a heatmap of genes annotated to a selected GO, KEGG or Hallmark term.

plotGeneSetHeatmap(input=norm_anno,
                   smp_table = sample_table,
                   plot_anno = plot_annotation,
                  
                   cat="GO", 
                   term="neutrophil degranulation",
                   organism = "human",
                   show_rownames =F, 
                   cluster_cols = F,
                   gene_type = "all")

plotGeneSetHeatmap(input=norm_mean,
                   smp_table = mean_sample_table,
                   plot_anno = mean_plot_annotation,
             
                   cat="GO", 
                   term="neutrophil degranulation",
                   organism = "human",
                   show_rownames =F, 
                   cluster_cols = F,
                   gene_type = "protein_coding")

plotGeneSetHeatmap(input=norm_anno,
                   smp_table = sample_table,
                   plot_anno = plot_annotation,
                  
                   cat="KEGG", 
                   term="Alzheimer disease",
                   organism = "human",
                   show_rownames =T, 
                   cluster_cols = F,
                   gene_type = "all")

8 Batch Effect Removal (treat with caution!)

In case you observe a batch effect in the data, the following code can help to define and correct for known and unknown batch effects.

If you did not observe a batch effect, this section should be skipped.

8.1 LIMMA: Known batch effects

In case the global analysis showed that your data suffers from a technical batch effect, e.g. your samples cluster according to a certain covariate such as “sex” or “date of experiment”, you might want to correct for this. For known batch effects, such as sequencing day or sex, you may want to try to include the variance explained by this covariate in your later differential gene expression analysis as a co-factor.

To check whether removing the variance caused by this covariate improves the clustering of your samples, you can correct your gene expression for this factor using limma and plot a PCA of the corrected data.

(https://bioconductor.org/packages/release/bioc/html/limma.html)

removedbatch_dds_vst <- limmaBatchEffectRemoval(input=dds_vst,
                                                   modelfactor = "condition",
                                                   batchfactor = "donor",
                                                   batchfactor_2 = NULL)
plotPCA(pca_input = removedbatch_dds_vst,
        ntop="all", 
        xPC=1,
        yPC=2,
        color="condition",
        anno_colour = ann_colors$condition,
        shape="donor",
        point_size=3,
        title="PCA of batch-corrected counts",
        gene_anno=gene_annotation,
        gene_type="all")

8.2 Batch-corrected table

For later plotting of Heatmaps, it may be useful to have a dataframe of the batch-corrected values as log2-transformed (removedbatch_anno_log2) as well as as normal counts (removedbatch_anno).

removedbatch_anno_log2<-removedbatch_dds_vst
removedbatch_anno_log2<-merge(removedbatch_anno_log2,norm_anno[,c("GENEID","SYMBOL","GENETYPE","DESCRIPTION","CHR")],by="row.names")
rownames(removedbatch_anno_log2)<-removedbatch_anno_log2$Row.names
removedbatch_anno_log2$Row.names<-NULL
removedbatch_anno<-removedbatch_dds_vst
removedbatch_anno<-2^removedbatch_anno
removedbatch_anno<-merge(removedbatch_anno,norm_anno[,c("GENEID","SYMBOL","GENETYPE","DESCRIPTION","CHR")],by="row.names")
rownames(removedbatch_anno)<-removedbatch_anno$Row.names
removedbatch_anno$Row.names<-NULL

8.2.1 Heatmaps etc. to plot the batch-corrected count_matrix

8.3 Create means of vst_anno and removedbatch_anno_log2

removedbatch_mean_log2<-mean_function(input=removedbatch_anno_log2,
                                      anno = sample_table,
                                      condition = "condition")

corner(removedbatch_mean_log2)
##                                GENEID Immature_moDCs LPStreated_moDCs
## ENSG00000000003.14 ENSG00000000003.14      0.9663107        0.6747506
## ENSG00000000419.12 ENSG00000000419.12      8.1330584        8.2173440
## ENSG00000000457.13 ENSG00000000457.13      6.4973990        7.0915807
## ENSG00000000460.16 ENSG00000000460.16      5.1619892        4.7156306
## ENSG00000000938.12 ENSG00000000938.12     11.3824790        9.8007535
##                    Immature_IL10APCs LPStreated_IL10APCs
## ENSG00000000003.14         0.7022162           0.7023509
## ENSG00000000419.12         8.2592445           7.9952041
## ENSG00000000457.13         6.6757021           6.6051387
## ENSG00000000460.16         5.2226122           5.1857171
## ENSG00000000938.12        11.3644457          11.0233604

8.4 heatmaps with removed batch

8.4.1 all samples

plotHeatmap(input=vst_anno,
            smp_table = sample_table,
            plot_anno = plot_annotation,
            geneset = "all",
            title = "Heatmap of all present genes (variance stabilized counts [log2] )",
            show_rownames = FALSE,
            cluster_cols = TRUE,
            gene_type = "all")

plotHeatmap(input=removedbatch_anno_log2,
            smp_table = sample_table,
            plot_anno = plot_annotation,
            geneset = "all",
            title = "Heatmap of all present genes (batch-removed counts [log2])",
            show_rownames = FALSE,
            cluster_cols = TRUE,
            gene_type = "all")

8.4.2 means of conditions

plotHeatmap(input=vst_mean,
            smp_table = mean_sample_table,
            plot_anno = mean_plot_annotation,
            geneset = "all",
            title = "Heatmap of all present genes (variance stabilized counts [log2] )",
            show_rownames = FALSE,
            cluster_cols = TRUE,
            gene_type = "all")

plotHeatmap(input=removedbatch_mean_log2,
            smp_table = mean_sample_table,
            plot_anno = mean_plot_annotation,
            geneset = "all",
            title = "Heatmap of all present genes (batch-removed counts [log2])",
            show_rownames = FALSE,
            cluster_cols = TRUE,
            gene_type = "all")

8.5 Sample-to-Sample correlation & distance

cd_input_batch<-removedbatch_anno_log2[,!colnames(removedbatch_anno_log2)%in%c("GENEID","SYMBOL","GENETYPE","DESCRIPTION","CHR")]

8.5.1 Correlation

# Correlation based on variance-stabilized counts & batch-corrected counts

# sampleCor <- as.matrix(cor(removedbatch_anno_log2[,!colnames(removedbatch_anno_log2)%in%c("GENEID","SYMBOL","GENETYPE","DESCRIPTION","CHR")], use="all.obs", method="pearson"))
# rownames(sampleCor)<- sample_table$ID
# colnames(sampleCor)<- sample_table$ID
# 
# pheatmap(sampleCor,
#          main="Sample Correlation based on batch-corrected counts",
#          annotation_row = plot_annotation,
#          annotation_col = plot_annotation,
#          annotation_colors = ann_colors,
#          cluster_rows = T,
#          cluster_cols = T,
#          fontsize = 8)

corr_function(sampleCor = cd_input_batch,
              title="Sample Correlation based on batch-corrected counts",
      
              gene_anno=gene_annotation,
              plot_anno=plot_annotation,
              gene_type="protein_coding",
              mean=F)

corr_function(sampleCor = cd_input_batch,
              title="Sample Correlation based on batch-corrected counts",
       
              gene_anno=gene_annotation,
              plot_anno=plot_annotation,
              gene_type="all",
              mean=F)

corr_function(sampleCor = cd_input_batch,
              title="Sample Correlation based on batch-corrected counts",
   
              gene_anno=gene_annotation,
              plot_anno=mean_plot_annotation,
              gene_type="all",
              mean=T)

8.5.2 Distance

# Sample Distances based on batch-corrected counts

# sampleDist <- as.matrix(dist(t(removedbatch_anno_log2[,!colnames(removedbatch_anno_log2)%in%c("GENEID","SYMBOL","GENETYPE","DESCRIPTION","CHR")])))
# rownames(sampleDist)<- sample_table$ID
# colnames(sampleDist)<- sample_table$ID
# 
# pheatmap(sampleDist,
#          clustering_distance_rows = dist(t(assay(dds_vst))),
#          clustering_distance_cols = dist(t(assay(dds_vst))), 
#          main="Sample distances based on batch-corrected counts per sample",
#          annotation_row = plot_annotation, 
#          annotation_col = plot_annotation,
#          annotation_colors = ann_colors,
#          fontsize = 8)

dist_function(sampleDist = cd_input_batch,
         
              gene_anno=gene_annotation,
              plot_anno=plot_annotation,
              title="Sample distances based on batch-corrected counts per sample",
              gene_type="all",
              mean=F)

dist_function(sampleDist = cd_input_batch,
   
              gene_anno=gene_annotation,
              plot_anno=mean_plot_annotation,
              title="Sample distances based on batch-corrected counts per sample",
              gene_type="all",
              mean=T)

8.6 Expression of selected genes across conditions

Plot the expression of selected genes across conditions.

plotSingleGene(data=removedbatch_anno, 
                symbol="WDR18", 
                 condition="condition",
               anno_colour=ann_colors$condition,
               shape= "donor")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
## [1] "Selected gene symbol assigned to one gene (Ensembl ID)"

8.7 SVA: Unknown batch effects

In case your samples do not cluster according to the condition of biological interested, but you observe distinct clustering according to an unknown latent variable, you may want to try identifying this “hidden”" batch variable using surrogate variable analysis (SVA).

The SVA package helps to define the variance in the data set that is not explained by your variables of interest and tries to model the respective surrogate variables. These surrogate variables can be included as factors in your DESeq2 model. (http://master.bioconductor.org/packages/release/workflows/html/rnaseqGene.html#removing-hidden-batch-effects; http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.0030161)

The goal of the sva is to remove all unwanted sources of variation while protecting the contrasts due to the primary variables. This leads to the identification of features that are consistently different between groups, removing all common sources of latent variation. In some cases, the latent variables may be important sources of biological vari- ability. If the goal of the analysis is to identify heterogeneity in one or more subgroups, the sva function may not be appropriate.

The first step in using the sva package is to properly format the data and create appropriate model matrices. The data should be a matrix with features (genes, transcripts, voxels) in the rows and samples in the columns. Below we obtain a matrix of normalized counts for which the average count across samples is larger than 1. This is the typical genes by samples matrix found in gene expression analyses. The sva package assumes there are two types of variables that are being considered: (1) adjustment variables and (2) variables of interest. For example, in a gene expression study the variable of interest might an indicator of cancer versus control. The adjustment variables could be the age of the patients, the sex of the patients, and a variable like the date the arrays were processed.

Two model matrices must be made: the “full model” and the “null model”. The null model is a model matrix that includes terms for all of the adjustment variables but not the variables of interest. The full model includes terms for both the adjustment variables and the variables of interest. The assumption is that you will be trying to analyze the association between the variables of interest and gene expression, adjusting for the adjustment variables. The model matrices can be created using the model.matrix() function.

After the model matrices have been created, we can apply the svaseq function to estimate batch and other artifacts.

The svaseq function performs two different steps. First it identifies the number of latent factors that need to be estimated. If the sva function is called without the n.sv argument specified, the number of factors will be estimated for you. If you prefer a specific number of SVs, you can set the n.sv parameter accordingly.

# Format and filter the input
dat <- counts(dds, normalized=TRUE)
idx <- rowMeans(dat) > 1
dat <- dat[idx,]

# Create the full model matrix - including both the adjustment variables and the variable of interest.  In this case we only have one variable of interest called Genotype_Age.
mod  <- model.matrix(~ condition, colData(dds))

# The null model contains only the adjustment variables. Since we are not adjusting for any other variables in this analysis, only an intercept is included in the model.
mod0 <- model.matrix(~   1, colData(dds))

Apply the svaseq() function to estimate the surrogate variables:

The svaseq function returns a list with four components, sv, pprob.gam, pprob.b, n.sv:

sv is a matrix whose columns correspond to the estimated surrogate variables. They can be used in downstream analyses as described below. pprob.gam is the posterior probability that each gene is associated with one or more latent variables. pprob.b is the posterior probability that each gene is associated with the variables of interest. n.sv is the number of surrogate variables estimated by the sva.

svseq <- svaseq(dat,
               mod,
               mod0)
## Number of significant surrogate variables is:  2 
## Iteration (out of 5 ):1  2  3  4  5
svseq$sv
##                [,1]        [,2]
##  [1,]  0.0004899811 -0.44975950
##  [2,] -0.1637062597 -0.20697600
##  [3,] -0.0656291476 -0.46170196
##  [4,] -0.0884280508 -0.42728237
##  [5,] -0.3422290945  0.37945345
##  [6,] -0.3188615390  0.28345666
##  [7,] -0.2814525092  0.15177991
##  [8,] -0.2712153647  0.18149453
##  [9,]  0.3800205381  0.16048612
## [10,]  0.4188261910  0.16006204
## [11,]  0.3946231423  0.08964198
## [12,]  0.3375621129  0.13934513

You can display whether the calculated SV correlate with any of your known covariates by changing dds$condition to the column of your sample_table that you are interested in.

par(mfrow = c(2, ncol(svseq$sv)))

condition ="donor"
for (i in 1:ncol(svseq$sv)) {
  stripchart(svseq$sv[, i] ~ dds[[condition]], vertical = TRUE, main = paste0("SV", i))
  abline(h = 0)
}

Add surrogate variables to annotation table to re-analyse the data including the surrogate variables in the analysis.

for (i in 1:ncol(svseq$sv)) {
  sample_table[[paste0("SV",i)]]<- svseq$sv[,i]
}

Again, we can check in a PCA what an influence removing the SVs has on the clustering of the samples.

removedbatch_dds_vst <- limmaBatchEffectRemoval(input=dds_vst,
                                                modelfactor = "condition",
                                                batchfactor = c("SV1","SV2","SV3"),
                                                batchfactor_2 = NULL)

plotPCA(pca_input = removedbatch_dds_vst,
        ntop="all",
        xPC=1,
        yPC=2,
        color="condition",
        anno_colour = ann_colors$condition,
        point_size=3,
        title="PCA of batch-corrected counts",
        gene_anno=gene_annotation,
        gene_type="all")

You can also use the batch-corrected table as input for all heatmaps and as input for the top varying genes as displayed in the following section:

8.7.0.1 Heatmap of all genes

plotHeatmap(input=removedbatch_anno_log2,
            smp_table = sample_table,
            plot_anno = plot_annotation,
           
            gene_type = "all",
            geneset = "all",
            title = "Heatmap of all present genes",
            show_rownames = FALSE,
            cluster_cols = TRUE)

8.7.0.2 Hierarchical Clustering of most variable genes

rv = genefilter::rowVars(removedbatch_anno_log2[,!colnames(removedbatch_anno_log2)%in%c("GENEID","SYMBOL","GENETYPE","DESCRIPTION","CHR")])
q75 = quantile(rowVars(removedbatch_anno_log2[,!colnames(removedbatch_anno_log2)%in%c("GENEID","SYMBOL","GENETYPE","DESCRIPTION","CHR")]), .75)
q75_names = names(which(rv > q75))
# clustered Columns & Rows
plotHeatmap(input=removedbatch_anno_log2,
            geneset = q75_names,
            smp_table = sample_table,
            plot_anno = plot_annotation,
           
            gene_type = "all",
            title = "Heatmap of all most variable genes",
            show_rownames = FALSE,
            cluster_cols = TRUE)

8.8 Sample-to-Sample correlation & distance

cd_batch_log2<-removedbatch_anno_log2[,!colnames(removedbatch_anno_log2)%in%c("GENEID","SYMBOL","GENETYPE","DESCRIPTION","CHR")]

8.8.0.1 Sample-to-Sample correlation & distance

# sampleCor <- as.matrix(cor(removedbatch_anno_log2[,!colnames(removedbatch_anno_log2)%in%c("GENEID","SYMBOL","GENETYPE","DESCRIPTION","CHR")], use="all.obs", method="pearson"))
# rownames(sampleCor)<- sample_table$ID
# colnames(sampleCor)<- sample_table$ID
# pheatmap(sampleCor,
#          main="Sample Correlation based on variance-stabilized counts",
#          annotation_row = plot_annotation,
#          annotation_col = plot_annotation,
#          annotation_colors = ann_colors,
#          cluster_rows = F,
#          cluster_cols = F,
#          fontsize = 8)

corr_function(sampleCor = cd_batch_log2,
               title="Sample Correlation based on variance-stabilized counts",
               gene_anno=gene_annotation,
               plot_anno=plot_annotation,
               gene_type="all",
               cluster_rows = F,
               cluster_cols = F,
               mean=F)

# sampleDist <- as.matrix(dist(t(removedbatch_anno_log2[,!colnames(removedbatch_anno_log2)%in%c("GENEID","SYMBOL","GENETYPE","DESCRIPTION","CHR")])))
# rownames(sampleDist)<- sample_table$ID
# colnames(sampleDist)<- sample_table$ID
# pheatmap(sampleDist,
#          clustering_distance_rows = dist(t(assay(dds_vst))),
#          clustering_distance_cols = dist(t(assay(dds_vst))), 
#          main="Sample distances based on variance-stabilized counts per sample",
#          annotation_row = plot_annotation, 
#          annotation_col = plot_annotation,
#          annotation_colors = ann_colors,
#          fontsize = 8)

dist_function(sampleDist = cd_batch_log2,
      
              gene_anno=gene_annotation,
              plot_anno=plot_annotation,
              title="Sample distances based on variance-stabilized counts per sample",
              gene_type="all",
              mean=F)

q75_vst = removedbatch_anno_log2[,!colnames(removedbatch_anno_log2)%in%c("GENEID","SYMBOL","GENETYPE","DESCRIPTION","CHR")][rowVars(removedbatch_anno_log2[,!colnames(removedbatch_anno_log2)%in%c("GENEID","SYMBOL","GENETYPE","DESCRIPTION","CHR")]) > quantile(rowVars(removedbatch_anno_log2[,!colnames(removedbatch_anno_log2)%in%c("GENEID","SYMBOL","GENETYPE","DESCRIPTION","CHR")]), .75),]
rcor <- rcorr(t(q75_vst),type="pearson")
rcor$sig <- rcor$P<0.05 & rcor$r>0 # define significant positive correlations
rcor_filt <- rcor$r*rcor$si
rcor_filt <- rcor_filt*upper.tri(rcor_filt)
rcor_filt<- replace(rcor_filt, which( rcor_filt==0), NA)
rcor_filt_melt <- melt(rcor_filt,na.rm = TRUE)
rcor_filt_melt_cutoff <- rcor_filt_melt[rcor_filt_melt$value>0.95,]
varR <- unique(rcor_filt_melt_cutoff$Var1)

plotHeatmap(input=removedbatch_anno_log2,
            geneset = varR,
            smp_table = sample_table,
            plot_anno = plot_annotation,
           
            gene_type = "all",
            title = "Heatmap of all variable and co-expressed genes",
            show_rownames = FALSE,
            cluster_cols = FALSE)

8.9 Include batch effect variables into DESeq2 model

In case you want to include the observed batch variables in your DESeq2 model, no matter if known covariates or surrogate variables, you can add these to your design formlua in front of the condition of interest and recalculate your dds object. In the following expample we will include a Sex and three SVs as batch effects.

ATTENTION: Skip this chunk, if you do not want to include any batch variables!

# ddssva <- dds
# ddssva$SV1 <- svseq$sv[,1]
# ddssva$SV2 <- svseq$sv[,2]
# ddssva$SV3 <- svseq$sv[,3]
# design(ddssva) <- ~ SV1 + SV2 + SV3 + Genotype_Age
# 
# dds <- DESeq(ddssva)

9 Differential Expression Analysis

After the DESeq() function performs the standard differential expression analysis steps, DESeq2’s results() function can extract a result table from the DESeqDataSet giving base means across samples, log2 fold changes, standard errors, test statistics, p-values and adjusted p-values.

We have written a function called DEAnalysis() that runs DESeq2’s results() and the lfcShrink() function with specified parameters on a set of pre-defined comparisons and returns a single DEresults object containing the respective result tables together with additional lists of the significant DE genes and the number of DE genes found.

The parameters of the DEAnalysis function are:

  1. condition: specify the condition that you are testing, e.g. treatment or genotype. This value must correspond to the column in the colData listing the factors you are comparing and the design formula.

  2. alpha: a significance cutoff used for optimizing the independent filtering.(default= 0.05)

  3. lfcThreshold: a non-negative value which specifies a log2 fold change threshold. The default value is 0, corresponding to a test that the log2 fold changes are equal to zero. (default = 0)

  4. sigFC: post testing significance criteria as a non-negative, non-log transformed FC cutoff (default = 2)

  5. multiple_testing: By default independent hypothesis weighting will be used as the multiple testing method (https://bioconductor.org/packages/3.7/bioc/vignettes/IHW/inst/doc/introduction_to_ihw.html). However, you can also edit the multiple testing method by setting the multiple_testing parameter to “holm”, “hochberg”, “hommel”, “bonferroni”, “BH”, “BY”,or “fdr”, which will perform independent filtering and p-value adjustment according to the specified method. (default = “IHW”)

  6. shrinkage: After calculating differential expression statistics, we can perform a so-called log2 fold change shrinkage. This shrinkage of effect size (LFC estimates) is useful for visualization and ranking of genes, as it removes the noise associated with log2 fold changes from low count genes without requiring arbitrary filtering thresholds. To shrink the LFC, set shrinkage = TRUE to pass the dds object to the function lfcShrink. (default = TRUE)

  7. shrinkType: The options for the shrinkage type are:

    • “normal” is the the original DESeq2 shrinkage estimator, an adaptive Normal distribution as prior.This is currently the default, although the default will likely change to apeglm in the October 2018 release given apeglm’s superior performance.

    • “apeglm” is the adaptive t prior shrinkage estimator from the apeglm package (Zhu, Ibrahim, and Love 2018).

    • “ashr” is the adaptive shrinkage estimator from the ashr package (Stephens 2016). Here DESeq2 uses the ashr option to fit a mixture of Normal distributions to form the prior, with method=“shrinkage”.

9.1 Define relevant comparisons

Define your comparisons in a data.frame with “comparison” in the first column and “control” in the second column.

comparison control
LPStreated_moDCs Immature_moDCs
Immature_IL10APCs Immature_moDCs
LPStreated_IL10APCs Immature_IL10APCs
comparison_table<-data.frame(comparison = c("LPStreated_moDCs","Immature_IL10APCs","LPStreated_IL10APCs"),
                             control = c("Immature_moDCs","Immature_moDCs","Immature_IL10APCs"))

9.2 Perform Differential Expression Testing

DEresults <- DEAnalysis(condition = "condition",
                        alpha=0.05 ,
                        lfcThreshold= 0,
                        sigFC = 2,
                        multiple_testing="IHW",
                        shrinkage = TRUE,
                        shrinkType="normal")
## using 'normal' for LFC shrinkage, the Normal prior from Love et al (2014).
## 
## Note that type='apeglm' and type='ashr' have shown to have less bias than type='normal'.
## See ?lfcShrink for more details on shrinkage type, and the DESeq2 vignette.
## Reference: https://doi.org/10.1093/bioinformatics/bty895
## using 'normal' for LFC shrinkage, the Normal prior from Love et al (2014).
## 
## Note that type='apeglm' and type='ashr' have shown to have less bias than type='normal'.
## See ?lfcShrink for more details on shrinkage type, and the DESeq2 vignette.
## Reference: https://doi.org/10.1093/bioinformatics/bty895
## using 'normal' for LFC shrinkage, the Normal prior from Love et al (2014).
## 
## Note that type='apeglm' and type='ashr' have shown to have less bias than type='normal'.
## See ?lfcShrink for more details on shrinkage type, and the DESeq2 vignette.
## Reference: https://doi.org/10.1093/bioinformatics/bty895

9.2.1 Summary of DE genes

We use a for loop to print the number of significantly up- and down-regulated genes over all comparisons.

DEcounts <- NULL

for(i in 1:nrow(comparison_table)){
  tmp <- unlist(DEresults[[1+i]]@Number_DE_genes)
  DEcounts <- rbind(DEcounts, tmp)
}

rownames(DEcounts) <- names(DEresults)[-1]

DEcounts
##                                          up_regulated_Genes
## LPStreated_moDCs_vs_Immature_moDCs                     1770
## Immature_IL10APCs_vs_Immature_moDCs                     739
## LPStreated_IL10APCs_vs_Immature_IL10APCs                137
##                                          down_regulated_Genes
## LPStreated_moDCs_vs_Immature_moDCs                       1711
## Immature_IL10APCs_vs_Immature_moDCs                       680
## LPStreated_IL10APCs_vs_Immature_IL10APCs                   30

9.3 8.1 General DE Gene analysis

9.3.1 Hierarchical Clustering of the union of DE genes

# the uDEG() function produces the union of the DE genes from the specified comparisons
allDEgenes <- uDEG(comparisons=c("LPStreated_moDCs_vs_Immature_moDCs",
                                 "Immature_IL10APCs_vs_Immature_moDCs",
                                 "LPStreated_IL10APCs_vs_Immature_IL10APCs"))

plotHeatmap(geneset = allDEgenes,
            smp_table = sample_table,
            plot_anno = plot_annotation,
           
            gene_type = "all",
            title = "Heatmap of all differentially expressed genes",
            show_rownames = FALSE,
            cluster_cols = TRUE)

9.3.2 Hierarchical Clustering of differentially expressed Transcription Factors

if(organism=="mouse"){
  DE_TF <- allDEgenes[which(allDEgenes %in% norm_anno[norm_anno$SYMBOL %in% TFlist_mm,]$GENEID)]
} else if(organism =="human"){
  DE_TF <- allDEgenes[which(allDEgenes %in% norm_anno[norm_anno$SYMBOL %in% TFlist_hs,]$GENEID)]
}

plotHeatmap(geneset = DE_TF,
            smp_table = sample_table,
            plot_anno = plot_annotation,
           
            gene_type = "all",
            title = "Heatmap of all differentially expressed transcription factors",
            show_rownames = TRUE,
            cluster_cols = FALSE)

9.3.3 Venn diagrams

plotVenn(comparisons =  c("LPStreated_moDCs_vs_Immature_moDCs",
                          "Immature_IL10APCs_vs_Immature_moDCs"),
         regulation ="up")

plotVenn(comparisons =  c("LPStreated_moDCs_vs_Immature_moDCs", 
                          "Immature_IL10APCs_vs_Immature_moDCs"))

9.3.4 Ratio-Ratio plots of DE genes

Ratio-Ratio plots help to compare the DE genes of two comparisons and are much more expressive than venn diagrams.

plotRatios(comp1 = "LPStreated_moDCs_vs_Immature_moDCs", 
           comp2 = "Immature_IL10APCs_vs_Immature_moDCs")

9.3.5 Fold change rank plots

plotFCrank(comp1 = "LPStreated_moDCs_vs_Immature_moDCs", 
           comp2 = "Immature_IL10APCs_vs_Immature_moDCs")

9.3.6 GSEA across comparisons

Define universe and gene sets for subsequent GSEA analyses.

# define universe
universe <- as.character(norm_anno$SYMBOL)
# change symbols to ENTREZ IDs (necessary for ClusterProfiler)
universe_Entrez <- bitr(universe, 
                        fromType="SYMBOL", 
                        toType="ENTREZID", 
                        OrgDb="org.Hs.eg.db")$ENTREZID
## 'select()' returned 1:many mapping between keys and columns
## Warning in bitr(universe, fromType = "SYMBOL", toType = "ENTREZID", OrgDb =
## "org.Hs.eg.db"): 24.2% of input gene IDs are fail to map...
universe_mouse2human <- getLDS(attributes = c("mgi_symbol"), 
                              filters = "mgi_symbol", 
                              values = universe, 
                              mart = useMart("ensembl", dataset = "mmusculus_gene_ensembl"), 
                              attributesL = c("hgnc_symbol"), 
                              martL = useMart("ensembl", dataset = "hsapiens_gene_ensembl"), 
                              uniqueRows=T)[,2]

universe_mouse2human_Entrez <- bitr(universe_mouse2human, 
                        fromType="SYMBOL", 
                        toType="ENTREZID", 
                        OrgDb="org.Hs.eg.db")$ENTREZID
## 'select()' returned 1:many mapping between keys and columns
## Warning in bitr(universe_mouse2human, fromType = "SYMBOL", toType =
## "ENTREZID", : 0.22% of input gene IDs are fail to map...

Next, we perform functional enrichment analysis based on Gene ontology and KEGG pathway enrichment across all comparisons tested to check for overlap in functional indications of the differentially regulated genes.

DEcompare <- compareGSEA(comparisons = c("LPStreated_moDCs_vs_Immature_moDCs", 
                                         "Immature_IL10APCs_vs_Immature_moDCs",
                                         "LPStreated_IL10APCs_vs_Immature_IL10APCs"),
                         organism = organism,
                         GeneSets = c("GO", "KEGG"),
                         pCorrection = "bonferroni", 
                         pvalueCutoff = 0.05,
                         qvalueCutoff = 0.05,
                         showMax = 20, 
                         ontology = "BP")
## Warning in bitr(DE_up, fromType = "SYMBOL", toType = "ENTREZID", OrgDb =
## OrgDb): 9.24% of input gene IDs are fail to map...
## Warning in bitr(DE_down, fromType = "SYMBOL", toType = "ENTREZID", OrgDb =
## OrgDb): 8.66% of input gene IDs are fail to map...
## Warning in bitr(DE_up, fromType = "SYMBOL", toType = "ENTREZID", OrgDb =
## OrgDb): 7.32% of input gene IDs are fail to map...
## Warning in bitr(DE_down, fromType = "SYMBOL", toType = "ENTREZID", OrgDb =
## OrgDb): 5.9% of input gene IDs are fail to map...
## Warning in bitr(DE_up, fromType = "SYMBOL", toType = "ENTREZID", OrgDb =
## OrgDb): 3.68% of input gene IDs are fail to map...
## [1] "Performing GO enrichment"
## [1] "Performing KEGG enrichment"
DEcompare$GOplot+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

DEcompare$KEGGplot+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

9.3.7 Likelihood ratio test

DESeq2 offers two kinds of hypothesis tests: the Wald test, where we use the estimated standard error of a log2 fold change to test if it is equal to zero, and the likelihood ratio test (LRT). The LRT examines two models for the counts, a full model with a certain number of terms and a reduced model, in which some of the terms of the full model are removed. The test determines if the increased likelihood of the data using the extra terms in the full model is more than expected if those extra terms are truly zero.

The LRT is therefore useful for testing multiple terms at once, for example testing 3 or more levels of a factor at once, or all interactions between two variables. The LRT for count data is conceptually similar to an analysis of variance (ANOVA) calculation in linear regression, except that in the case of the Negative Binomial GLM, we use an analysis of deviance (ANODEV), where the deviance captures the difference in likelihood between a full and a reduced model.

The likelihood ratio test can be performed by specifying test=“LRT” when using the DESeq function, and providing a reduced design formula, e.g. one in which a number of terms from design(dds) are removed. The degrees of freedom for the test is obtained from the difference between the number of parameters in the two models. A simple likelihood ratio test, if the full design was ~Genotype_Age would look like:

dds_lrt <- DESeq(dds, 
                 test="LRT", 
                 reduced=~1)
## using pre-existing normalization factors
## estimating dispersions
## found already estimated dispersions, replacing these
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
res_lrt <- results(dds_lrt)

lrt_anno<- as.data.frame(assay(dds_lrt)) 

# add gene annotation extracted from the gtf file 
gene_annotation <- gene_annotation[match(rownames(lrt_anno), gene_annotation$GENEID),]

# check if row names of the lrt_anno and the gene annotation match perfectly
all(rownames(lrt_anno) == gene_annotation$GENEID)
## [1] TRUE
lrt_anno$GENEID<-rownames(lrt_anno) 
lrt_anno <- merge(lrt_anno,
                   gene_annotation,
                   by = "GENEID") 

rownames(lrt_anno)<-lrt_anno$GENEID
lrt_anno[1:3,c(1:2, (ncol(lrt_anno)-5):ncol(lrt_anno))]
##                                GENEID 1553 1567 1568 SYMBOL       GENETYPE
## ENSG00000000003.14 ENSG00000000003.14    0    1    1 TSPAN6 protein_coding
## ENSG00000000419.12 ENSG00000000419.12  182  282  381   DPM1 protein_coding
## ENSG00000000457.13 ENSG00000000457.13   48   88  131  SCYL3 protein_coding
##                                                                                                       DESCRIPTION
## ENSG00000000003.14                                              tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858]
## ENSG00000000419.12 dolichyl-phosphate mannosyltransferase subunit 1, catalytic [Source:HGNC Symbol;Acc:HGNC:3005]
## ENSG00000000457.13                                   SCY1 like pseudokinase 3 [Source:HGNC Symbol;Acc:HGNC:19285]
##                    CHR
## ENSG00000000003.14   X
## ENSG00000000419.12  20
## ENSG00000000457.13   1

Creates mean for lrt_anno

lrt_mean<-mean_function(input=lrt_anno, 
                         anno=sample_table,
                         condition="condition")

9.4 Heatmap of genes with likelihood ratio test p value below 0.05

# Plot heatmap of significantly different genes 
lrt_significantGenes <- rownames(res_lrt[!is.na(res_lrt$padj)&
                                           res_lrt$padj < 0.05 ,])
plotHeatmap(input=norm_anno,
            smp_table = sample_table,
            plot_anno = plot_annotation,
           
            geneset = lrt_significantGenes,
            title = paste("Heatmap of all LRT-significant genes:",  length(lrt_significantGenes),"(normalized values)", sep=""),
            show_rownames = FALSE,
            cluster_cols = T,
            gene_type = "all")

plotHeatmap(input=norm_mean,
            smp_table = mean_sample_table,
            plot_anno = mean_plot_annotation,
       
            geneset = lrt_significantGenes,
            title = paste("Heatmap of all LRT-significant genes:",  length(lrt_significantGenes),"(normalized values)", sep=""),
            show_rownames = FALSE,
            cluster_cols = TRUE,
            gene_type = "all")

# Question to Steffi & Jonas: 
#Do we want to plot the normalized counts for the heatmap or the LRT results(IDK in how far they vary from the normalization)

plotHeatmap(input=lrt_anno,
            smp_table = sample_table,
            plot_anno = plot_annotation,
           
            geneset = lrt_significantGenes,
            title = paste("Heatmap of all LRT-significant genes:",  length(lrt_significantGenes),"(LRT-values)", sep=""),
            show_rownames = FALSE,
            cluster_cols = T,
            gene_type = "all")

plotHeatmap(input=lrt_mean,
            smp_table = mean_sample_table,
            plot_anno = mean_plot_annotation,
     
            geneset = lrt_significantGenes,
            title = paste("Heatmap of all LRT-significant genes:",  length(lrt_significantGenes),"(LRT-values)", sep=""),
            show_rownames = FALSE,
            cluster_cols = TRUE,
            gene_type = "all")

# Plot heatmap of 1000 top significant genes
lrt_1000mostvariableGenes <- head(rownames(res_lrt[order(res_lrt$padj),]),1000)

plotHeatmap(input=norm_anno,
            smp_table = sample_table,
            plot_anno = plot_annotation,
           
            geneset = lrt_1000mostvariableGenes,
            title = "Heatmap of 1000 top significant LRT genes",
            show_rownames = FALSE,
            cluster_cols = TRUE,
            gene_type = "all")

plotHeatmap(input=norm_mean,
            smp_table = mean_sample_table,
            plot_anno = mean_plot_annotation,

            geneset = lrt_1000mostvariableGenes,
            title = "Heatmap of 1000 top significant LRT genes",
            show_rownames = FALSE,
            cluster_cols = TRUE,
            gene_type = "all")

9.5 8.2 Specific DE Gene analysis

Now that we have had a global look at the differentially expressed genes, we want to have a more detailed look at specific comparisons. Therefore, we use various plots to get an impression of the differences in the specified comparisons.

9.5.1 MAplots

A MAplot represents the average log expression versus the average ratio (or fold change) between two conditions.

  plotMA(comparison = "LPStreated_moDCs_vs_Immature_moDCs",
         ylim=c(-2,2))

9.5.2 p value distributions

We plot a histogram of the p-values resulting from testing the specified comparisons.

To understand p-value distributions, this source is very helpful: http://varianceexplained.org/statistics/interpreting-pvalue-histogram/

  plotPvalues(comparison="LPStreated_moDCs_vs_Immature_moDCs")

9.5.3 Heatmaps of DE genes

Plot heatmaps of the expression of the DE genes of the specified comparisons. The argument conditions allows to include samples from specified conditions in the heatmap.

plotDEHeatmap(comparison = "LPStreated_moDCs_vs_Immature_moDCs",
              factor="condition",
              conditions="all",
              gene_type="all",
              show_rownames = FALSE,
              cluster_cols = F)

plotDEHeatmap(comparison = "Immature_IL10APCs_vs_Immature_moDCs",
              factor="condition",
              conditions=c("Immature_moDCs","Immature_IL10APCs", "LPStreated_moDCs"),
              gene_type="all",
              show_rownames = FALSE,
              cluster_cols = F)

9.5.4 Volcano Plots

A volcano plot is a type of scatter-plot that is used to quickly identify changes in large data sets composed of replicate data. It plots significance versus fold-change on the y and x axes, respectively.

# Plot Volcano Plot
plotVolcano(comparison= "LPStreated_moDCs_vs_Immature_moDCs",
            labelnum=20)

9.5.5 Gene Set Enrichment of DE genes

Next, we perform gene set enrichment analyses of the DE genes of the specified comparisons.

6 gene sets are available: * GO * KEGG * MSigDB Hallmark (H) * MSigDB Cannonical Pathways (C2) Gene sets from pathway databases. Usually, these gene sets are canonical representations of a biological process compiled by domain experts.
* MSigDB Motifs (C3, DNA binding motifs) Gene sets representing potential targets of regulation by transcription factors or microRNAs. The sets consist of genes grouped by short sequence motifs they share in their non-protein coding regions. The motifs represent known or likely cis-regulatory elements in promoters and 3’-UTRs. The C3 collection is divided into two sub-collections: MIR and TFT
* MSigDB Immunologic gene sets (C7) Gene sets that represent cell states and perturbations within the immune system. The signatures were generated by manual curation of published studies in human and mouse immunology.

GSEA_LPStreated_moDCs_vs_Immature_moDCs <-  GSEA(comparison="LPStreated_moDCs_vs_Immature_moDCs",
                                 organism = "human",
                                 GeneSets = c("GO",
                                              "KEGG",
                                              "Hallmark",
                                              "canonicalPathways",
                                              "Motifs",
                                              "ImmunoSignatures"),
                                 GOntology = "BP",
                                 pCorrection = "bonferroni", # choose the p-value adjustment method
                                 pvalueCutoff = 0.05, # set the unadj. or adj. p-value cutoff (depending on correction method)
                                 qvalueCutoff = 0.05, # set the q-value cutoff (FDR corrected)
                                 showMax = 20,
                                       font.size = 8)
## 'select()' returned 1:1 mapping between keys and columns
## Warning in bitr(DE_up, fromType = "SYMBOL", toType = "ENTREZID", OrgDb =
## OrgDb): 9.24% of input gene IDs are fail to map...
## 'select()' returned 1:1 mapping between keys and columns
## Warning in bitr(DE_down, fromType = "SYMBOL", toType = "ENTREZID", OrgDb =
## OrgDb): 8.66% of input gene IDs are fail to map...
## [1] "Performing GO enrichment"
## wrong orderBy parameter; set to default `orderBy = "x"`
## wrong orderBy parameter; set to default `orderBy = "x"`
## [1] "Performing KEGG enrichment"
## wrong orderBy parameter; set to default `orderBy = "x"`
## wrong orderBy parameter; set to default `orderBy = "x"`
## 'select()' returned 1:1 mapping between keys and columns
## Warning in bitr(genes_up_hsa, fromType = "SYMBOL", toType = "ENTREZID", :
## 0.08% of input gene IDs are fail to map...
## 'select()' returned 1:1 mapping between keys and columns
## Warning in bitr(genes_down_hsa, fromType = "SYMBOL", toType = "ENTREZID", :
## 0.22% of input gene IDs are fail to map...
## [1] "Performing Hallmark enrichment"
## wrong orderBy parameter; set to default `orderBy = "x"`
## wrong orderBy parameter; set to default `orderBy = "x"`
## [1] "Performing Motif enrichment"
## wrong orderBy parameter; set to default `orderBy = "x"`
## [1] "Performing immunesignature enrichment"
## wrong orderBy parameter; set to default `orderBy = "x"`
## wrong orderBy parameter; set to default `orderBy = "x"`
GSEA_LPStreated_moDCs_vs_Immature_moDCs$GOup_plot

GSEA_LPStreated_moDCs_vs_Immature_moDCs$KEGGup_plot

GSEA_LPStreated_moDCs_vs_Immature_moDCs$HALLMARKup_plot

GSEA_LPStreated_moDCs_vs_Immature_moDCs$ImmSigup_plot

9.5.5.1 Heatmaps of GSEA result responsible genes

Please make sure that the specified terms are enriched terms of the selected results.

plotGSEAHeatmap(GSEA_result = GSEA_LPStreated_moDCs_vs_Immature_moDCs$GOdown, 
              GeneSet="GO", 
              term = "SRP-dependent cotranslational protein targeting to membrane",
              show_rownames = TRUE,
                cluster_cols = F,
                gene_type = "all")

plotGSEAHeatmap(GSEA_result = GSEA_LPStreated_moDCs_vs_Immature_moDCs$GOup, 
                GeneSet="GO", 
                term = "response to virus",
                show_rownames = TRUE,
                cluster_cols = F,
                gene_type = "protein_coding")

plotGSEAHeatmap(input=norm_mean,
                smp_table = mean_sample_table,
                plot_anno = mean_plot_annotation, 
                GSEA_result = GSEA_LPStreated_moDCs_vs_Immature_moDCs$GOup, 
                GeneSet="GO", 
                term = "response to virus",
                show_rownames = TRUE,
                cluster_cols = F,
                gene_type = "all")

10 Export of the results

10.1 Count table as txt file

Exporting the annotated, normalized expression table:

write.table(norm_anno,
            paste(dir, "/Analysis/", "Tables/", "DESeq2_norm_anno_", Sys.Date(), ".txt", sep = ""),
            sep = "\t",
            quote = F,
            row.names = F)

10.2 Export as Excel sheet

As an optimal output for cooperation partners, we create an Excel workbook with the normalized count table, the rlog-transformed counts, the DE test parameters & the statistics of the respective differential expression tests.

# create Workbook
ExcelOutput<-createWorkbook()

# add sample table
sheet <- addWorksheet(ExcelOutput, sheetName = "Samples")
writeDataTable(ExcelOutput, sheet, sample_table, withFilter=FALSE)

# add normalized counts
tmp <- norm_anno
tmp$LRT_padj <- res_lrt$padj
sheet <- addWorksheet(ExcelOutput, sheetName = "Normalized counts & Annotation")
writeDataTable(ExcelOutput, sheet, tmp, withFilter=FALSE)

# add variance-stabilized counts
tmp <- as.data.frame(assay(dds_vst))
tmp$GENEID <- norm_anno$GENEID
tmp$SYMBOL <- norm_anno$SYMBOL
sheet <- addWorksheet(ExcelOutput, sheetName = "Variance-stabilized counts")
writeDataTable(ExcelOutput, sheet, tmp, withFilter=FALSE)

# add DE test parameters
tmp <- stack(unlist(DEresults$parameters))
colnames(tmp)<-c("value","parameter")
tmp <- rbind(tmp, data.frame(value = as.character(design(dds))[2], parameter = "design"))
sheet <- addWorksheet(ExcelOutput, sheetName = "DE parameters")
writeDataTable(ExcelOutput, sheet, tmp, withFilter=FALSE)

# add combined DE results
cResults <- NULL
for (i in 2:length(names(DEresults))) {
  if(i == 2){
    cResults<-data.frame(DEresults[[i]]@results[,c("GENEID","SYMBOL","baseMean","log2FoldChange","padj","regulation")])
    colnames(cResults)[4:6]<- paste(names(DEresults)[i],colnames(cResults)[4:6],sep="_")
  }else{
    tmp<-data.frame(DEresults[[i]]@results[,c("log2FoldChange","padj","regulation")])
    colnames(tmp)<- paste(names(DEresults)[i],colnames(tmp),sep="_")    
    cResults<-cbind(cResults,tmp)
  }
}
sheet <- addWorksheet(ExcelOutput, sheetName = "combined DE Results")
writeDataTable(ExcelOutput, sheet, cResults, withFilter=FALSE)

# add DE results in single sheets
for(i in 2:length(names(DEresults))){
  gc()
  sheet <- addWorksheet(ExcelOutput, sheetName = substr(names(DEresults[i]),1 , 30))
  writeDataTable(ExcelOutput, sheet, DEresults[[i]]@results, withFilter=FALSE)
  sheet <- addWorksheet(ExcelOutput, sheetName = paste(substr(names(DEresults[i]),1 , 22),"_upDEGs"))
  writeDataTable(ExcelOutput, sheet, DEresults[[i]]@DE_genes$up_regulated_Genes, withFilter=FALSE)
  sheet <- addWorksheet(ExcelOutput, sheetName = paste(substr(names(DEresults[i]),1 , 20),"_downDEGs"))
  writeDataTable(ExcelOutput, sheet, DEresults[[i]]@DE_genes$down_regulated_Genes, withFilter=FALSE)
  }

# Save Workbook
filename <- paste(dir,"/Analysis/Tables/AnalysisOutput_",gsub(":","-",as.character(Sys.time())),".xlsx",sep="")
saveWorkbook(ExcelOutput, file=filename)

11 Save image and session Info

save.image(paste(dir, "/Analysis/", Sys.Date(), "_Image.RData", sep = ""))
sessionInfo()